Welcome to the Festival of Genomics workshop on single-cell analyses

This is an R Markdown document that contains instructions and code for the examples used in todays workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses.

Hour 1: Getting Started

Sara’s section …

Check that the Brain Atlas data files are present

The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")

setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE

If any of the preceding lines return FALSE, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.

Read the Brain Atlas data files into R

The read.csv function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv). In the resulting coutns object, we have genes in rows (24057) and cells in columns (1679).

counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells) 
## 'data.frame':    24057 obs. of  20 variables:
##  $ A01101401: num  0 992 2.57 0 0 0 0 0 0 1 ...
##  $ A01101402: num  0 2287 177 0 0 ...
##  $ A01101403: num  0 492 0 0 0 ...
##  $ A01101404: num  0 1932 1 0 0 ...
##  $ A01101405: num  0 1425 2 0 0 ...
##  $ A01101406: num  0 130 3 0 0 ...
##  $ A01101407: num  0 2110 3041 0 17 ...
##  $ A01101408: num  0 955 101 0 0 ...
##  $ A02271433: num  0 326 0 0 0 349 0 0 0 0 ...
##  $ A02271434: num  0 933 1042 0 0 ...
##  $ A02271435: num  0 296 0 0 0 ...
##  $ A02271436: num  0 31 200 0 0 984 0 0 0 36 ...
##  $ A02271437: num  0 970 0 0 3 ...
##  $ A02271438: num  0 594 355 0 228 ...
##  $ A02271439: num  0 2290 3294 0 0 ...
##  $ A02271440: num  0 29 1 0 0 ...
##  $ A12101401: num  0 373 540 0 0 ...
##  $ A12101402: num  0 462 197 0 0 0 0 0 0 0 ...
##  $ A12101403: num  0 516 96 0 498 ...
##  $ A12101404: num  0 1019 0 0 0 ...

The cell_metadata.csv file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.

cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame':    1679 obs. of  16 variables:
##  $ long_name         : chr  "A01101401" "A01101402" "A01101403" "A01101404" ...
##  $ cre               : chr  "Calb2" "Calb2" "Calb2" "Calb2" ...
##  $ collection_date   : chr  "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
##  $ sequencing_type   : chr  "hiseq" "hiseq" "hiseq" "hiseq" ...
##  $ total_reads       : int  23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
##  $ all_mapped_percent: num  93.5 92.9 90.5 93.2 93.1 ...
##  $ mRNA_percent      : num  54.4 45.7 48.3 51.4 51.1 ...
##  $ genome_percent    : num  30.1 35.5 34 33.8 32.1 ...
##  $ ercc_percent      : num  4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
##  $ tdt_permillion    : num  306 341 106 371 264 ...
##  $ major_class       : chr  "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
##  $ sub_class         : chr  "Vip" "Vip" "L4" "Vip" ...
##  $ major_dissection  : chr  "V1" "V1" "V1" "V1" ...
##  $ layer_dissectoin  : chr  "All" "All" "All" "All" ...
##  $ color_code        : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ short_name        : chr  "A200_V" "A201_V" "A202_V" "A203_V" ...

More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt file provided. Here is a peek at the contents of that file.

# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1.   genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2.   genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3.   ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4.  cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5.   cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster). 
## 6.   cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49). 
## 
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585

Check that the desired R packages have been installed

Once a list of desired R packages is finalized, can check that they are installed with

require(scde)     #bioconductor
require(monocle)  #bioconductor
require(sincell)  #bioconductor
require(scDD)     #github
require(ggplot2)  #cran
require(devtools) #cran
require(Oscope)

If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed. Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example edgeR, this can be done with the following code:

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

For CRAN packages, for example devtools, installation can be done with the following code:

install.packages(devtools)

For Github packages, for example scDD, installation can be done with the following code:

install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")

Visualize major axes of variation in a PCA plot

# extract top 500 variable genes
gene.var <- apply(counts, 1, function(x) var(log(x[x>0])))
counts.top500 <- counts[which(rank(-gene.var)<=500),]

counts.pca <- prcomp(log(counts.top500+1),
                   center = TRUE,
                   scale. = TRUE) 
summary(counts.pca)$importance[,1:5]
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     25.44755 15.98222 8.598395 7.371017 7.197588
## Proportion of Variance  0.38569  0.15213 0.044030 0.032360 0.030850
## Cumulative Proportion   0.38569  0.53783 0.581860 0.614220 0.645070
plot(counts.pca, type="l", main="Top 10 PCs")

color_class <- rainbow(length(unique(cells$major_class)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$major_class))], pch=20,
      main="PCA plot of cells colored by derived major class")

color_class <- rainbow(length(unique(cells$layer_dissectoin)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$layer_dissectoin))], pch=20,
      main="PCA plot of cells colored by Dissection Layer")

Hour 2: Normalization and Quality Control of scRNA-seq

Apua’s Section …

Normalization

Quality Control (QC measures)

detectionRate <- apply(counts, 2, function(x) sum(x > 0) / length(x))
hist(detectionRate)

Hour 3: Analysis Modules

Keegan’s Section …

Identify the highly variable genes

In this module, we will identify genes that are highly variable across the entire cell population. This will give us a subset of genes to focus on that is likely enriched for those that are driving heterogeneity among cellular subtypes.

For ease in downstream analysis with various R packages we’ll make use of today, we’ll convert the data.frames that currently (separately) house the counts and cell metadata into a Bioconductor object called an SCDESet introduced by the scater package. This object is a container that can hold raw and normalized expression values, along with the metadata for both samples and genes, in one place. We’ll also perform some basic preprocessing and normalization here to adjust for library size using the pool and deconvolve method in the scran package, as well as remove genes with expression in only a few cells (*** these steps may not be necessary if this was already carried out in Apua’s section… Can refer back to the section or remove if it is redundant ***).

library(scater)
## 
## Attaching package: 'scater'
## The following objects are masked from 'package:monocle':
## 
##     cellPairwiseDistances, cellPairwiseDistances<-
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename
## The following object is masked from 'package:stats':
## 
##     filter
library(scran)
rownames(cells) <- cells$long_name
eset <- newSCESet(countData = counts, phenoData = AnnotatedDataFrame(cells))  

# normalize counts for library size using the pool & deconvolve method of Lun et al. (2016)
eset <- computeSumFactors(eset, sizes=c(20, 40, 60, 80))
summary(sizeFactors(eset))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01716  0.69440  1.04300  1.22000  1.46100 12.03000
plot(sizeFactors(eset), colSums(counts(eset))/1e6, log="xy",
    ylab="Library Size (Total Counts in Millions)", xlab="Pooled Size Factor Estimate",
    main="Normalization factor versus library size")

# filter out low-abundance genes (at least 10 cells out of 1679 must have nonzero expression)
keep <- rowSums(exprs(eset)) >= 10
eset <- eset[keep,] 
sum(keep)
## [1] 18195
# use the size factors calculated above to normalize the counts - these get placed in the 'exprs' slot
eset <- normalize(eset)

Next, we’ll perform some calculations to help us identify genes that have high variability. To do so, we have to take into account the relationship between mean expression level and variance of expression level (also commonly referred to as ‘over-dispersion’ in the literature). Namely, that what we observe in RNA-seq count data is that genes with higher expression have higher variance. This is also a property of the Negative Biomial distribution, which is often used to model count data from RNA-seq experiments.

We use the trendVar function of the scran package to estimate the relationship between the mean and variance. This function fits a smooth type of curve to capture the trend in mean log-expression and variance log-expression, which will serve as an estimate of the baseline technical variability if we assume that the majority of the genes are constantly expressed (i.e. that there is no significant biological variability). Note that this is a rather large assumption, and it is not necessarily true in our experiment. How might this assumption be evaluated? It is ideal to independently estimate this relationship for so-called ‘spike-in’ data (where a small number of artifical transcripts have been added at known concentrations such that any variability is assured to be technical). However, these can be challenging to use and won’t necessarily capture all of the technical variability depending on where in the protocol they are added. Since we do not have spike-ins in our dataset, we proceed by estimating this relationship on the entire gene set, by using the use.spikes=FALSE option in the trendVar function.

Once this relationship has been estimated, the decomposeVar (also in the scran package) essentially removes the estimated technical variability component from the total variability to calculate the estimated biological variability. This is what we aim to use to find the highly variable genes. We also visualize the mean and variance log-expression relationship, along with the estimated technical variabilty fit.

var.fit <- trendVar(eset, trend="loess", use.spikes=FALSE, span=0.2)
var.out <- decomposeVar(eset, var.fit)

# plot the mean versus variance of log-expression, along with the technical variance fit
plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression", 
    ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)

Finally, we extract the top 2000 genes with highest biological variability to use in downstream analyses. We also take a look at the distributions of normalized log-expression values for the top 20 biologically variable genes. Note that the biological variability estimates provided by seem to be rather robust to outlier cells, as none of the genes with highest variability seem to be driven by outliers. Note that there exist methods to find significantly highly variable genes in scRNA-seq data, however, without some other independent estimate of technical variability, we don’t have anything to compare to in order to establish significance. What is an example of an independent measurement of technical variability?

# extract and examine the top 2000 genes by biological variance
top.hvg <- order(var.out$bio, decreasing=TRUE)[1:2000]
head(var.out[top.hvg,])
##            mean    total      bio      tech
## Gad1   6.558620 46.86994 33.37464 13.495299
## Slc6a1 7.155704 43.15470 31.00193 12.152772
## Cnr1   8.604368 33.41635 24.46439  8.951962
## Synpr  5.556287 37.48550 22.39621 15.089286
## Enc1   9.105725 29.57515 21.59476  7.980394
## Rgs4   8.230319 30.92008 21.20304  9.717040
# construct a new eset object that only contains the highly variable genes for downstream analysis
eset.hvg <- eset[top.hvg,]

# plot distribution of the top 20 highly variable genes
top20 <- top.hvg[1:20]
boxplot(t(exprs(eset)[top20,]), las=2, ylab="Normalized log-expression", col="dodgerblue", main="Top 20 Highly Variable Genes")

Scanning the names of the top 20 highly variable genes, note that Gad1, Cnr1, Arpp21, Sparcl1, Npy, Nfib, and Vip are all in the list of 228 well-known neuronal subtype/non-neuronal marker genes listed in Supplementary Table 9. What does this tell us about our experiment?

TODO: add heatmap of HVGs by cell subtype?

Identify differentially expressed genes

In this module we will identify differentially expressed genes between neuronal subtypes, as well as between neurons and non-neuronal cell types. As we have learned in this workshop, though scRNA-seq data shares many characteristics with bulk RNA-seq data, there are major differences which need to be accommodated in order to properly analyze it and fully exploit its advantages (namely the presence of dropouts, or zeroes, and increased heterogeneity from a combination of technical and biological sources). First we will explore the method SCDE which compares expression magnitude between groups of cells while adjusting for drop-out and amplification biases on expression magnitude by fitting error models for individual cells.

library(scde)

# remove the three cells with unknown major class
eset.hvg <- eset.hvg[,-which(pData(eset.hvg)$major_class == "Unknown")]

# find DE genes between excitatory and inhibitory neuronal subtypes
eset.neur <- eset.hvg[,which(pData(eset.hvg)$major_class %in% c("Inhibitory", "Excitatory"))]
group <- factor(pData(eset.neur)$major_class)

# fit error models, get post probs, etc...

# find DE genes between neuron and non-neuronal
# define a factor that separates neuron and non-neuron
neuron <- pData(eset.hvg)$major_class
neuron[!(neuron %in% c("Inhibitory", "Excitatory"))] <- "Non-neuron"
neuron[neuron %in% c("Inhibitory", "Excitatory")] <- "Neuron"
neuron <- factor(neuron)

Next we’ll explore the method scDD which also compares expression between groups of cells, but aims to detect differences in expression patterns that may be more complex than an overall magnitude change (or mean shift). Specifically, cells within each group may be captured at different expression states, as may happen if a gene is oscillatory, or if it exhibits stochastic, “bursty” expression dynamics. scDD can detect if these types of complex patterns differ across cell groups.

library(scDD)

# construct object to send to scDD

# find DE genes between excitatory and inhibitory neuronal subtypes

# find DE genes between neuron and non-neuronal

TODO: fill out detailed stesps of the scde and scDD analysis modules

Order cells by “Psuedotime” (temporal-spatial variation)

Just like we saw in the Highly Variable Genes module, it can be convenient to store all information about an experiment (measurements and metadata) in one R object. This is a recurring theme in the Bioconductor project, so many R packages on Bioconductor rely on constructs like these. The benefits are gains in efficiency and reduction of errors in associating metadata objects back to the measurement objects, and often many package developers use objects that are compatible across several methods. However, with the explosion in methods development for scRNA-seq analysis, we have not yet established a ‘standard’ object type for storing and analysing scRNA-seq data in R. The result is that packages developed simulataneously and independently have slightly different formatting requirements. So long story short, we have to convert our SCESet object from the scater package to a similar CellDataSet object for use with the monocle package. Don’t worry about the details here.

Our goal in this analysis module is to discover a latent trajectory of variation using Monocle which may represent spatial organization of the cells. Since it is recommended to use genes that are believed to be important in determining where cells are located in relation to eachother (the ‘ordering’), we use the dataset that only contains the top 2000 highly variable genes that we constructed a previous analysis module.

library(monocle)
# construct a CellDataSet object with our SCESet object that contains only the top 2000 highly variable genes
cset <- newCellDataSet(cellData = exprs(eset.hvg), phenoData = phenoData(eset.hvg))
class(cset)
## [1] "CellDataSet"
## attr(,"package")
## [1] "monocle"

The setOrderingFilter function requires that we select which genes to use in the ordering. Since we have already filtered our cset dataset to include only the top 2000 highly variable genes, ordered by biological variance, we can subset the top 100 genes to include the top 100 highly variable genes. Before applying Monocle’s ordering algorithm, we use the reduceDimensions function to perform dimension reduction on this set of 2000 genes. Finally, the orderCells function carries out the Monocle algorithm. Note that we need to specify the number of paths here, which represent the number of main cell fates (or states) believed to be present. Here we choose one, which represents the path along the cortical layers as linear. Note that this step can take several minutes to complete.

Since we have already demonstrated that major cell type (inhibitory neuron, excitatory neuron, and non-neuronal cell) is a major source of variation, in this module we will examine the trajectories within the major neuronal cell types. To do so, we create two separate CellDataSets by subsetting on the major class listed in the phenotypic metadata and carrying out the Monocle algorithm separately on both. We also subset on the cells that have a major labeled dissection layer other than “All” (lower versus upper in Inhibitory neurons; L1-L6 in Excitatory neurons).

options(expressions = 500000) # 'under-the-hood' option; need to execute if using OSX or Windows due to a limitation on C stack size

# Run Monocle on subset of Inhibitory neurons
cset.inhibitory <- cset[,phenoData(cset)$major_class=="Inhibitory" & 
                         phenoData(cset)$layer_dissectoin %in% c("lower", "upper")]
cset.inhibitory <- setOrderingFilter(cset.inhibitory, ordering_genes=rownames(cset.inhibitory)[1:100])
cset.inhibitory <- reduceDimension(cset.inhibitory, use_irlba = FALSE) # Reduce dimensionality
## Reducing to independent components
cset.inhibitory <- orderCells(cset.inhibitory, num_paths = 1, reverse = FALSE) # Order cells

# Run Monocle on subset of Excitatory neurons
cset.excitatory <- cset[,phenoData(cset)$major_class=="Excitatory" & 
                         phenoData(cset)$layer_dissectoin %in% c("L1", "L2/3", "L4", "L5", "L6", "L6a", "L6b")]
cset.excitatory <- setOrderingFilter(cset.excitatory, ordering_genes=rownames(cset.excitatory)[1:100])
cset.excitatory <- reduceDimension(cset.excitatory, use_irlba = FALSE) # Reduce dimensionality
## Reducing to independent components
cset.excitatory <- orderCells(cset.excitatory, num_paths = 1, reverse = FALSE) # Order cells

Next we plot the spanning tree to visualize the cell ordering projected on the first two components of variation. We color the cells in this space by dissection layer see if we have recovered any of the spatial structure of the cortical layers. We also color the cells by Cre reporter line. Did we recover any spatial organization in these two subsets? Does there seem to be variation by Cre reporter?

# plotting by various factors
plot_spanning_tree(cset.inhibitory, color_by="layer_dissectoin") # plot spanning tree

plot_spanning_tree(cset.inhibitory, color_by = "cre")

# Excitatory
plot_spanning_tree(cset.excitatory, color_by="layer_dissectoin") # plot spanning tree

plot_spanning_tree(cset.excitatory, color_by="cre") # plot spanning tree

How does the spanning tree change if we allow a different number of paths? How does the spanning tree change if we include different gene sets?

Identify genes with oscillating expression patterns

library(Oscope)

TODO: add illustration of Oscope